State-of-the-art object detectors are treated as black boxes due to their highly non-linear internal computations. Even with unprecedented advancements in detector performance, the inability to explain how their outputs are generated limits their use in safety-critical applications. Previous work fails to produce explanations for both bounding box and classification decisions, and generally make individual explanations for various detectors. In this paper, we propose an open-source Detector Explanation Toolkit (DExT) which implements the proposed approach to generate a holistic explanation for all detector decisions using certain gradient-based explanation methods. We suggests various multi-object visualization methods to merge the explanations of multiple objects detected in an image as well as the corresponding detections in a single image. The quantitative evaluation show that the Single Shot MultiBox Detector (SSD) is more faithfully explained compared to other detectors regardless of the explanation methods. Both quantitative and human-centric evaluations identify that SmoothGrad with Guided Backpropagation (GBP) provides more trustworthy explanations among selected methods across all detectors. We expect that DExT will motivate practitioners to evaluate object detectors from the interpretability perspective by explaining both bounding box and classification decisions.
translated by 谷歌翻译
Trusting the predictions of deep learning models in safety critical settings such as the medical domain is still not a viable option. Distentangled uncertainty quantification in the field of medical imaging has received little attention. In this paper, we study disentangled uncertainties in image to image translation tasks in the medical domain. We compare multiple uncertainty quantification methods, namely Ensembles, Flipout, Dropout, and DropConnect, while using CycleGAN to convert T1-weighted brain MRI scans to T2-weighted brain MRI scans. We further evaluate uncertainty behavior in the presence of out of distribution data (Brain CT and RGB Face Images), showing that epistemic uncertainty can be used to detect out of distribution inputs, which should increase reliability of model outputs.
translated by 谷歌翻译
Safety-critical applications like autonomous driving use Deep Neural Networks (DNNs) for object detection and segmentation. The DNNs fail to predict when they observe an Out-of-Distribution (OOD) input leading to catastrophic consequences. Existing OOD detection methods were extensively studied for image inputs but have not been explored much for LiDAR inputs. So in this study, we proposed two datasets for benchmarking OOD detection in 3D semantic segmentation. We used Maximum Softmax Probability and Entropy scores generated using Deep Ensembles and Flipout versions of RandLA-Net as OOD scores. We observed that Deep Ensembles out perform Flipout model in OOD detection with greater AUROC scores for both datasets.
translated by 谷歌翻译
Increasingly high-stakes decisions are made using neural networks in order to make predictions. Specifically, meteorologists and hedge funds apply these techniques to time series data. When it comes to prediction, there are certain limitations for machine learning models (such as lack of expressiveness, vulnerability of domain shifts and overconfidence) which can be solved using uncertainty estimation. There is a set of expectations regarding how uncertainty should ``behave". For instance, a wider prediction horizon should lead to more uncertainty or the model's confidence should be proportional to its accuracy. In this paper, different uncertainty estimation methods are compared to forecast meteorological time series data and evaluate these expectations. The results show how each uncertainty estimation method performs on the forecasting task, which partially evaluates the robustness of predicted uncertainty.
translated by 谷歌翻译
过度拟合和概括是机器学习中的一个重要概念,因为只有对通用应用程序进行概括的模型才是有趣的。然而,一些学生难以通过讲座和练习来学习这个重要的概念。在本文中,我们描述了学生误解过度拟合的常见例子,并为可能的解决方案提供了建议。我们涵盖了学生对过度拟合,过度拟合解决方案的误解以及通常与过度拟合问题相混淆的实施错误。我们希望我们的论文可以有助于提高学生对这个重要主题的理解和讲座。
translated by 谷歌翻译
机器人关节产生的建模轨迹是复杂的,并且需要轨迹生成,聚类和分类等高级活动。Disentagled代表学习承诺有助于学习的进步,但他们尚未在机器人生成的轨迹中进行评估。在本文中,我们在从3 DOF机器人手臂产生的1M机器人轨迹的数据集上评估三个解除挂起的VAES($ \β$ -VAE,DECORR VAE和新的$ \ BETA $ -DECORR VAE)。我们发现基于去形的标准,轨迹质量和与地理潜在特征的相关性,基于去相关的配方表现了最佳。我们希望这些结果增加了机器人控制中无监督学习的使用。
translated by 谷歌翻译
基于加固学习(RL)的解决方案是在包括机器人,医疗保健和工业自动化的各种领域中采用。当这些解决方案运作良好时,大多数焦点都会给出,但在出于分发输入时,它们会失败。RL策略与大多数机器学习模型共享相同的故障。除了rl的分布检测通常没有很好地覆盖于文献中,并且这项任务缺乏基准。在这项工作中,我们提出了一种基准,通过修改非视标准环境的物理参数或损坏视觉环境的状态观察来评估强化学习设置中的ood检测方法。我们讨论了生成可以产生OOD数据的自定义RL环境的方法,并评估3个不确定性方法进行ood检测任务。我们的结果表明,集合方法具有较低的检测性能,在多种环境中具有较低的标准偏差。
translated by 谷歌翻译
神经网络中的不确定性量化有望增加AI系统的安全性,但目前尚不清楚培训集大小如何变化。在本文中,我们评估了七种在时尚Mnist和CiFar10上的不确定性方法,因为我们子样本并产生各种训练套装尺寸。我们发现校准误差和分配检测性能强烈依赖于训练集大小,大多数方法在具有小型训练集的测试集上被错误化。基于梯度的方法似乎估计了估计的认识性不确定性,并且是受训练集规模受影响最大的。我们希望我们的结果可以指导未来的不确定性量化研究,并帮助从业者根据其特定的可用数据选择方法。
translated by 谷歌翻译
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.
translated by 谷歌翻译
The mediocre performance of conventional federated learning (FL) over heterogeneous data has been facilitating personalized FL solutions, where, unlike conventional FL which trains a single global consensus model, different models are allowed for different clients. However, in most existing personalized FL algorithms, the collaborative knowledge across the federation was only implicitly passed to the clients in ways such as model aggregation or regularization. We observed that this implicit knowledge transfer fails to maximize the potential value of each client's empirical risk toward other clients. Based on our observation, in this work, we propose Personalized Global Federated Learning (PGFed), a novel personalized FL framework that enables each client to personalize its own global objective by explicitly and adaptively aggregating the empirical risks of itself and other clients. To avoid massive ($O(N^2)$) communication overhead and potential privacy leakage, each client's risk is estimated through a first-order approximation for other clients' adaptive risk aggregation. On top of PGFed, we develop a momentum upgrade, dubbed PGFedMo, to more efficiently utilize clients' empirical risks. Our extensive experiments under different federated settings with benchmark datasets show consistent improvements of PGFed over the compared state-of-the-art alternatives.
translated by 谷歌翻译